14 research outputs found

    Metadata Exploitation in Large-scale Data Migration Projects

    Get PDF
    The inherent complexity of large-scale information integration efforts has led to the proliferation of numerous metadata capabilities to improve upon project management, quality control and governance. In this paper, we utilize complex information integration projects in the context of SAP application consolidation to analyze several new metadata capabilities, which enable improved governance and control of data quality. Further, we identify key focus areas for shaping future industrial and academic research efforts by investigating certain unaddressed aspects around these capabilities that often negatively impact information integration projects

    Ontology-guided Reference Data Alignment in Information Integration Projects

    Get PDF
    One of the hard problems in information integration projects (harmonizing data from various legacy sources into one or more targets) is the appropriate alignment of reference data values across systems. Without this alignment, the process of loading records into the target systems might fail because the target might reject any record with an unknown reference data value or different underlying data semantics. Today, detecting reference data tables and determining the relative alignment between a source and a target is largely manual, cumbersome, error-prone and costly. We propose a novel ontology-guided approach to detect reference data tables and their relative alignment across source/target systems to enable semi-automated creation of translation tables

    Motif Recognition

    Get PDF
    The problem of recognizing motifs from biological data has been well-studied and numerous algorithms, both exact and approximate, have been proposed to address the underlying issue. We strongly believe that open availability and ease of accessibility of quality implementations for such algorithms are critical to the research community, in order to directly reproduce and utilize the results from other studies, so as not to reinvent the wheel. Moreover, it is also important for the implementation to be as generic as possible so that any researcher can to extend it with minimal effort to test a newly implemented algorithmic extension or heuristic. With this motivation, we choose to focus an existing algorithm, PatternBranching and, to a lesser degree, Yang2004. We analyze these approaches for minor heuristical changes & speed-ups by adjusting certain thresholds, and finally, implement the variant in high-level language (Java) using thought through programming practices and generic, extensible interfaces. We also analyze the performance of PatternBranching using a synthetically generated test-suite for a variety of sequence lengths and report the results. Code from this project will be made freely available online to the research community

    ANEXdb: An Integrated Animal Annotation and Microarray EXpression Database

    Get PDF
    All publicly available porcine expressed sequences were assembled to create longer, fuller transcripts for annotation purposes. The longer sequences were then used as queries in sequence alignment and comparison software to transfer functional annotation from their homologues in other species. In addition to transferred annotation, sequence variation was also predicted from the assembly. This information can then be used with expression data from high through-put expression measures, such as microarrays, to more fully understand the underlying mechanisms of biological processes. Both kinds of data, expression and annotation, are housed together and available at www.anexdb.org

    Ontology-guided extraction of structured information from unstructured text: Identifying and capturing complex relationships

    No full text
    Many applications call for methods to enable automatic extraction of structured information from unstructured natural language text. Due to the inherent challenges of natural language processing, most of the existing methods for information extraction from text tend to be domain specific. This thesis explores a modular ontology-based approach to information extraction that decouples domain-specific knowledge from the rules used for information extraction. Specifically, the thesis describes: 1. A framework for ontology-driven extraction of a subset of nested complex relationships (e.g., Joe reports that Jim is a reliable employee) from free text. The extracted relationships are semantically represented in the form of RDF (resource description framework) graphs, which can be stored in RDF knowledge bases and queried using query languages for RDF. 2. An open source implementation of SEMANTIXS, a system for ontology-guided extraction and semantic representation of structured information from unstructured text. 3. Results of experiments that offer evidence of the utility of the proposed ontology-based approach to extract complex relationships from text.</p

    On a robust document classification approach using TF-IDF scheme with learned, context-sensitive semantics.

    No full text
    Document classification is a well-known task in information retrieval domain and relies upon various indexing schemes to map documents into a form that can be consumed by a classification system. Term Frequency-Inverse Document Frequency (TF-IDF) is one such class of term-weighing functions used extensively for document representation. One of the major drawbacks of this scheme is that it ignores key semantic links between words and/or word meanings and compares documents based solely on the word frequencies. Majority of the current approaches that try to address this issue either rely on alternate representation schemes, or are based upon probabilistic models. We utilize a non-probabilistic approach to build a robust document classification system, which essentially relies upon enriching the classical TF-IDF scheme with context-sensitive semantics using a neural-net based learning component.</p
    corecore